617 research outputs found

    Labeling Workflow Views with Fine-Grained Dependencies

    Get PDF
    This paper considers the problem of efficiently answering reachability queries over views of provenance graphs, derived from executions of workflows that may include recursion. Such views include composite modules and model fine-grained dependencies between module inputs and outputs. A novel view-adaptive dynamic labeling scheme is developed for efficient query evaluation, in which view specifications are labeled statically (i.e. as they are created) and data items are labeled dynamically as they are produced during a workflow execution. Although the combination of fine-grained dependencies and recursive workflows entail, in general, long (linear-size) data labels, we show that for a large natural class of workflows and views, labels are compact (logarithmic-size) and reachability queries can be evaluated in constant time. Experimental results demonstrate the benefit of this approach over the state-of-the-art technique when applied for labeling multiple views.Comment: VLDB201

    Replicated Data and Partition Failures

    Get PDF
    In a distributed database system, data is often replicated to improve performance and availability. By storing copies of shared data on processors where it is frequently accessed, the need for expensive, remote read accesses is decreased. By storing copies of critical data on processors with independent failure modes, the probability that at least one copy of the data will be accessible increases. In theory, data replication makes it possible to provide arbitrarily high data availability. In practice, realizing the benefits of data replication is difficult since the correctness of data must be maintained. One important aspect of correctness with replicated data is mutual consistency: all copies of the same logical data-item must agree on exactly one current value for the data-item. Furthermore, this value should make sense in terms of the transactions executed on copies of the data-item. When communication fails between sites containing copies of the same logical data-item, mutual consistency between copies becomes complicated to ensure. The most disruptive of these communication failures are partition failures, which fragment the network into isolated subnetworks called partitions. Unless partition failures are detected and recognized by all affected processors, independent and uncoordinated updates may be applied to different copies of the data, thereby compromising the correctness of data. Consider, for example, an Airline Reservation System implemented by a distributed database which splits into two partitions when the communication network fails. If, at the time of the failure, all the nodes have one seat remaining for PAN AM 537, reservations could be made in both partitions. This would violate correctness: who should get the last seat? There should not be more seats reserved for a flight than physically exist on the plane. (Some airlines do not implement this constraint and allow overbookings.) The design of a replicated data management algorithm tolerating partition failures (or partition processing strategy) is a notoriously hard problem. Typically, the cause or extent of a partition failure cannot be discerned by the processors themselves. At best, a processor may be able to identify the other processors in its partition; but, for the processors outside of its partition, it will not be able to distinguish between the case where those processors are simply isolated from it and the case where those processors are down. In addition, slow responses can cause the network to appear partitioned even when it is not, further complicating the design of a fault-tolerant algorithm

    Scaling the Tower of Babel: Integration and Warehousing with the WWW

    Get PDF

    Search and Result Presentation in Scientific Workflow Repositories

    Get PDF
    We study the problem of searching a repository of complex hierarchical workflows whose component modules, both composite and atomic, have been annotated with keywords. Since keyword search does not use the graph structure of a workflow, we develop a model of workflows using context-free bag grammars. We then give efficient polynomial-time algorithms that, given a workflow and a keyword query, determine whether some execution of the workflow matches the query. Based on these algorithms we develop a search and ranking solution that efficiently retrieves the top-k grammars from a repository. Finally, we propose a novel result presentation method for grammars matching a keyword query, based on representative parse-trees. The effectiveness of our approach is validated through an extensive experimental evaluation

    Answering Regular Path Queries on Workflow Provenance

    Full text link
    This paper proposes a novel approach for efficiently evaluating regular path queries over provenance graphs of workflows that may include recursion. The approach assumes that an execution g of a workflow G is labeled with query-agnostic reachability labels using an existing technique. At query time, given g, G and a regular path query R, the approach decomposes R into a set of subqueries R1, ..., Rk that are safe for G. For each safe subquery Ri, G is rewritten so that, using the reachability labels of nodes in g, whether or not there is a path which matches Ri between two nodes can be decided in constant time. The results of each safe subquery are then composed, possibly with some small unsafe remainder, to produce an answer to R. The approach results in an algorithm that significantly reduces the number of subqueries k over existing techniques by increasing their size and complexity, and that evaluates each subquery in time bounded by its input and output size. Experimental results demonstrate the benefit of this approach

    Provenance Views for Module Privacy

    Get PDF
    Scientific workflow systems increasingly store provenance information about the module executions used to produce a data item, as well as the parameter settings and intermediate data items passed between module executions. However, authors/owners of workflows may wish to keep some of this information confidential. In particular, a module may be proprietary, and users should not be able to infer its behavior by seeing mappings between all data inputs and outputs. The problem we address in this paper is the following: Given a workflow, abstractly modeled by a relation R, a privacy requirement \Gamma and costs associated with data. The owner of the workflow decides which data (attributes) to hide, and provides the user with a view R' which is the projection of R over attributes which have not been hidden. The goal is to minimize the cost of hidden data while guaranteeing that individual modules are \Gamma -private. We call this the "secureview" problem. We formally define the problem, study its complexity, and offer algorithmic solutions

    A Performance Analysis of Timed Synchronous Communication Primitives

    Get PDF
    Two algorithms for timed synchronous communication between a single sender and single receiver have recently appeared in the literature. Each weakens the definition of correct timed synchronous communication in a different way, and exhibits a different undesirable behavior. In this paper, their performance is analyzed, and their sensitivity to various parameters is discussed. These parameters include how long the processes are willing to wait for communication to be successful, how well synchronized the processes are, the assumed upper bound on message delay, and the actual end-to-end message delay distribution. We conclude by discussing the fault tolerance of the algorithms, and propose a mixed strategy that avoids some of the performance problems

    Updating Complex Value Databeses

    Get PDF
    Query languages and their optimizations have been a very important issue in the database community. Languages for updating databases, however, have not been studied to the same extent, although they are clearly important since databases must change over time. The structure and expressiveness of updates is largely dependent on the data model. In relational databases, for example, the update language typically allows the user to specify changes to individual fields of a subset of a relation that meets some selection criterion. The syntax is terse, specifying only the pieces of the database that are to be altered. Because of its simplicity, most of the optimizations take place in the internal processing of the update rather than at the language level. In complex value databases, the need for a terse and optimizable update language is much greater, due to the deeply nested structures involved. Starting with a query language for complex value databases called the Collection Programming Language (CPL), we describe an extension called CPL+ which provides a convenient and intuitive specification of updates on complex values. CPL is a functional language, with powerful optimizations achieved through rewrite rules. Additional rewrite rules are derived for CPL+ and a notion of deltafication is introduced to transform complete updates, expressed as conventional CPL expressions, into equivalent update expressions in CPL+. As a result of applying these transformations, the performance of complex updates can increase substantially

    Adding Time to Synchronous Process Communications

    Get PDF
    In distributed real-time systems, communicating processes cannot be delayed for arbitrary amounts of time while waiting for messages. Thus, communication primitives used for real-time programming usually allow the inclusion of a deadline or timeout to limit potential delays due to synchronization. This paper interprets timed synchronous communication as having absolute deadlines. Various ways of implementing deadlines are discussed, and two useful timed synchronous communication problems are identified which differ in the number of of participating senders and receivers and type of synchronous communication. For each problem, a simple algorithm is presented and shown to be correct. The algorithms are shown to guarantee maximal success and to require the smallest delay intervals during which processes wait for synchronous communication. We also evaluate the number of messages used to reach agreement
    • …
    corecore